Section 3 describes the development of explanatory variables (features) using the Polymer class.
In addition to Polymetrics, the following libraries are imported:
import pandas as pd
import numpy as np
import Polymetrics as poly
import FileImport
import polymetrics_config
import traceback
import plotly.express as px
from IPython.display import display
The data used in the example is taken from a patent US2013/0046061 (Hermel-Davidock et al.). The patent lists four inventive samples (IS) and four comparative samples (CS). The distinction between inventive and comparative samples is made using a novel descriptor - Comonomer Distribution Constant (CDC). The inventive PE samples show CDC values greater than 45. CDC is calculated solely using CEF plots and is primarily dependent on how the weight fraction data is distributed around its centroid. The lower the spread, the higher the CDC value. The spread in the data can also be determined independently by other statistical quantities such as standard deviation (STDEV), coefficient of variation (COV), median absolute deviation (MedianAD), interquartile range (IQR), etc.
In this notebook, we shall see how these features can be developed in Polymetrics. The feature selection method is also discussed.
The XLSXImport function prepares the imported polymer objects for further processing.
df_in = FileImport.XLSXImport("Article/Example_Dataset.xlsx", sheet_name = 'Data')
For polymer objects are usually described by their specifications and physical properties. Alternatively, one can also derive specialized attributes of polymers from thermoanalytical, chromatographic, and characterization data. The analytical data tends to provide a better resolution of microstructure and can be used effectively to describe polymer characteristics.
# Filtering relevant data for the analysis
df_pat = df_in[(df_in['Project'] == 'US20130046061') & (df_in['Type'] == 'Resin_Developmental')]
The function takes CEF plots and interpolates the data over a specified range of temperatures split into a fixed number of intervals. Statistical analysis is carried out on the array, and the values are reported in dictionary form. Presently the function returns statistical quantities more geared towards describing the spread in the data. (standard deviation, interquartile range, coefficient of variation, and absolute deviations).
The loop sequentially applies the function on each polymer object. The features returned by the function are stored in FeaturesEng_df.
In the patent dataset, there are four inventive samples (IS) and four comparative samples (CS). The CEF data is provided for all the resins except for CS4. In absence of a CEF plot, the BasicStats function is not able to generate descriptors for the sample CS4. The try/except method in Python acknowledges the error and continues the execution of the code.
In the code below, features are developed for every polymer object by looping through the polymer objects sequentially. The additional features developed from the experimental data can be combined with the rest of the variables to make a features dataset.
FeaturesEng_df = pd.DataFrame(columns = df_pat.index) # Original index related information is retained. Note that the polymer
# objects are arranged column-wise to start with.
for i, polymer_ in zip(df_pat.index, df_pat.to_dict(orient="records")):
PE = poly.Polymer(polymer_)
try:
# Feature Building -
BasicStats_dict = PE.BasicStats(Interpolate = True, minT = 30.0 , maxT = 110.0)
# In polymetrics, the featuresEng generating functions return dictionary objects.
FeaturesEng_dict = {**BasicStats_dict} #Creates a union of N dictionaries using unpacking method.
FeaturesEng_df.loc[:,i]=pd.Series(FeaturesEng_dict)
except Exception as error:
print('Polymetrics Error at', polymer_['Identifier'], repr(error))
#traceback.print_exc()
pass
FeaturesEng_df = FeaturesEng_df.T #Transpose of the FeaturesEng DataFrame for row-wise arrangement of polymer objects.
# This step can potentially change the datatype of elements from float to object. The datatype is casted to float64.
FeaturesEng_df = FeaturesEng_df.astype('float64')
display(pd.concat([df_pat['Identifier'], FeaturesEng_df], axis = 1))
#print(FeaturesEng.dtypes)
Polymetrics Error at CS4 AttributeError("'Polymer' object has no attribute 'df_CEF'")
| Identifier | Mean | STDEV | COV | Median | IQR | MAD | MedianAD | AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | IS1 | 1.548817 | 2.335605 | 1.507993 | 0.202345 | 2.279806 | 1.916741 | 0.218350 | 124.153609 |
| 1 | IS2 | 1.146400 | 3.611261 | 3.150089 | 0.012504 | 0.336941 | 1.864347 | 0.072177 | 91.899543 |
| 2 | IS3 | 1.450821 | 2.758006 | 1.900997 | 0.197935 | 1.426555 | 1.806271 | 0.197935 | 116.298258 |
| 3 | IS4 | 1.220110 | 3.120180 | 2.557294 | 0.012646 | 0.457814 | 1.835769 | 0.019832 | 97.803985 |
| 4 | CS1 | 1.180145 | 1.358352 | 1.151004 | 0.416925 | 2.291558 | 1.206711 | 0.418673 | 94.600780 |
| 5 | CS2 | 1.153728 | 1.191008 | 1.032312 | 1.144547 | 1.581618 | 0.926675 | 0.927813 | 92.483233 |
| 6 | CS3 | 1.117421 | 1.099402 | 0.983874 | 0.813129 | 1.533233 | 0.872619 | 0.757883 | 89.572560 |
| 7 | CS4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The FeaturesEng_df is then appended to the rest of the data present in df_pat DataFrame. The heatmap calculated using pairwise correlations between the variables shows that many features are heavily correlated.
df_pat_numeric = df_pat.select_dtypes(np.number) # numeric data is separated from the df_pat
Features = pd.concat([df_pat_numeric, FeaturesEng_df], axis=1) # the newly developed features are appended to the rest of the features
Features.dropna(inplace = True, how = 'any', axis = 0) # Removes CS4
# pairwise correlation is plotted using an interactive heatmap
fig1 = px.imshow(Features.corr(method = 'pearson'),
color_continuous_scale=px.colors.diverging.BrBG,
color_continuous_midpoint=0,
width = 800,
height = 800,
title = 'Correlation Matrix')
fig1.update_layout(font=dict(size = 15))
fig1.show()
Several problems encountered in the domain have fewer observations and many features. Thus the dimensionality reduction is an important step in preparing the data for analysis. The drop_correlated function gives full control to scientists to retain key features relevant to the problem and drop unwanted features showing spurious correlation.
X_data = poly.drop_correlated(Features, coeff = 0.7, Retain = ['COV', 'delHm'], Drop = ['Density', 'Tc', 'delHc', 'Tm'], Plot = False)
# pairwise correlation is plotted using an interactive heatmap
fig2 = px.imshow(X_data.corr(method = 'pearson'),
color_continuous_scale=px.colors.diverging.BrBG,
color_continuous_midpoint=0,
width = 600,
height = 600,
title = 'Correlation Matrix with selected features')
fig2.update_layout(font=dict(size = 15))
fig2.show()
Correlated variables in remaining data to drop ['Mz', 'ZSVR', 'I2', 'I10', 'Unsat_1M_C', 'Mean', 'STDEV', 'Median', 'IQR', 'MAD', 'MedianAD', 'AUC'] Variables correlated with the variables to retain ['CDC', 'Mn']